Epidemiology is a data science – if only epidemiologists knew

Success stories of modernizing data science practice and education in epidemiology

Konrad H. Stopsack, BIPS
Travis A. Gerke, cStructure
Emily Riederer, Capital One
Malcolm Barrett, Stanford University

Wednesday, June 11, 2025

Data Science

Medicine

Epidemiology

Informatics

Biostatistics

Health

Population

Computer

Data

Science

Science

Science

Science

Science

Epidemiologic data science

Caring for our data before we put it into a model or into a paper.

Epidemiologists care about their data

Welcome to this SER session!

Konrad Stopsack (BIPS)
Epidemiologic research

Table creation

Travis Gerke (PCCTC/cStructure)
Clinical trials/Startup

DAG/LLM-based causal workflows

Emily Riederer (Capital One)
Finance

Project organization and workflows

Malcolm Barrett (Stanford)
Research/programming

Code review

Questions?

Please submit your questions using the Whova App as we go along.

Use Case 1
No more hand-typed numbers – Epidemiologic results tables in 2025

Konrad H. Stopsack

Professor and Chair, Department of Epidemiologic Methods and Etiologic Research
Leibniz Institute for Prevention Research and Epidemiology – BIPS
Bremen, Germany

I have no relevant financial relationships to disclose.

How do our analyses produce results?

  1. Write code in a text editor
  1. Submit code to a server that houses the data
  1. Receive a log file containing the results
  1. Type the results into Word

This process is such a pain
 

  1. It is unclear which data were read in and which code produced which result
  2. Numbers are manually transcribed and rounded
  3. Even the tiniest change to the analysis requires redoing many manual steps

Creating epidemiologic results tables, 2025: Wish list

  1. Descriptive and inferential results should go hand-in-hand
  2. Code should produce a publication-ready table
  3. Sensitivity analyses should be easy and quick

Let us make creating epidemiologic results tables easy, reproducible, and fun!

Example data built into R

data(flchain, package = "survival")

Monoclonal gammopathy of unclear significance (MGUS) and mortality, Olmsted County, MN, 1995–2003

age sex sample_yr creatinine MGUS status futime death
97 F 1997 1.7 No MGUS 0.23 1
92 F 2000 0.9 No MGUS 3.51 1
94 F 1997 1.4 No MGUS 0.19 1
92 F 1996 1.0 No MGUS 0.31 1
93 F 1996 1.1 No MGUS 2.84 1
90 F 1997 1.0 No MGUS 3.71 1
90 F 1996 0.8 No MGUS 7.81 1
90 F 1999 1.2 No MGUS 1.02 1
93 F 1996 1.2 No MGUS 9.06 1
91 F 1996 0.8 No MGUS 3.63 1
… plus 7864 additional observations.

Creating epidemiologic results tables, 2025

library(rifttable)

tribble(
  ~type,         ~label,
  "total",       "Participants",
  "events/time", "Deaths/person-years",
  "cuminc",      "10-year risk",
  "cumincratio", "10-year risk ratio (95% CI)",
  "hr",          "Hazard ratio (95% CI)"
) |> 
  mutate(
    exposure = "mgus_factor",
    event = "death",
    time = "futime",
    arguments = list(list(timepoint = 10))
  ) |> 
  rifttable(data = flchain) |> 
  rt_gt()
MGUS status No MGUS MGUS
Participants 7759 115
Deaths/person-years 2153/77632 16/1292
10-year risk 0.24 0.11
10-year risk ratio (95% CI) 1 (reference) 0.45 (0.30, 1.01)
Hazard ratio (95% CI) 1 (reference) 0.44 (0.27, 0.73)

Sensitivity analyses made easy

tribble(
  ~type,          ~label,
  "events/total", "Deaths/participants",
  "hr",           "Hazard ratio (95% CI)"
) |> 
  mutate(
    exposure = "mgus_factor",
    event = "death",
    time = "futime"
  ) |> 
  rifttable(data = flchain) |> 
  rt_gt()
MGUS status No MGUS MGUS
Deaths/participants 2153/7759 16/115
Hazard ratio (95% CI) 1 (reference) 0.44 (0.27, 0.73)
tribble(
  ~type,          ~label,                      ~confounders,
  "events/total", "Deaths/participants",       "",
  "hr",           "Hazard ratio (95% CI)",     "",
  "hr",           "  Creatinine-adjusted",     "+ creatinine",
  "hr",           "  Creatinine/age-adjusted", "+ creatinine + age",
) |> 
  mutate(
    exposure = "mgus_factor",
    event = "death",
    time = "futime"
  ) |> 
  rifttable(data = flchain) |> 
  rt_gt()
MGUS status No MGUS MGUS
Deaths/participants 2153/7759 16/115
Hazard ratio (95% CI) 1 (reference) 0.44 (0.27, 0.73)
Creatinine-adjusted 1 (reference) 0.52 (0.32, 0.84)
Creatinine/age-adjusted 1 (reference) 0.89 (0.54, 1.45)
tribble(
  ~type,          ~label,                ~ci,
  "events/total", "Deaths/participants", NA,
  "",             "Hazard ratio",        NA,
  "hr",           "  with 95% CI",       NA,
  "hr",           "  with 80% CI",       0.8,
  "hr",           "  with 99.5% CI",     0.995
) |> 
  mutate(
    exposure = "mgus_factor",
    event = "death",
    time = "futime",
    confounders = "+ creatinine"
  ) |> 
  rifttable(data = flchain) |> 
  rt_gt()
MGUS status No MGUS MGUS
Deaths/participants 2153/7759 16/115
Hazard ratio
with 95% CI 1 (reference) 0.52 (0.32, 0.84)
with 80% CI 1 (reference) 0.52 (0.37, 0.71)
with 99.5% CI 1 (reference) 0.52 (0.26, 1.04)

Stratified analyses and regression analyses at once

tribble(
  ~type,          ~label,                    ~stratum,
  "events/total", "Deaths/participants",     "",
  "hr",           "Hazard ratio (95% CI)",   c("F", "M"),
  "",             "*Female*",                "F",
  "events/total", "  Deaths/participants",   "F",
  "hr",           "  Hazard ratio (95% CI)", "F",
  "",             "*Male*",                  "M",
  "events/total", "  Deaths/participants",   "M",
  "hr",           "  Hazard ratio (95% CI)", "M",
) |> 
  mutate(
    exposure = "mgus_factor",
    event = "death",
    time = "futime",
    effect_modifier = "sex"
  ) |> 
  rifttable(data = flchain) |> 
  rt_gt()
MGUS status No MGUS MGUS
Deaths/participants 2153/7759 16/115
Hazard ratio (95% CI) 1 (reference) 0.44 (0.27, 0.73)
Female
Deaths/participants 1154/4282 11/68
Hazard ratio (95% CI) 1 (reference) 0.55 (0.30, 0.99)
Male
Deaths/participants 999/3477 5/47
Hazard ratio (95% CI) 1 (reference) 0.32 (0.13, 0.76)

Different exposure? No problem

tribble(
  ~type,          ~label,                    ~stratum,
  "events/total", "Deaths/participants",     "",
  "hr",           "Hazard ratio (95% CI)",   c("F", "M"),
  "",             "*Female*",                "F",
  "events/total", "  Deaths/participants",   "F",
  "hr",           "  Hazard ratio (95% CI)", "F",
  "",             "*Male*",                  "M",
  "events/total", "  Deaths/participants",   "M",
  "hr",           "  Hazard ratio (95% CI)", "M",
) |> 
  mutate(
    exposure = "mgus_factor",
    event = "death",
    time = "futime",
    effect_modifier = "sex"
  ) |> 
  rifttable(data = flchain) |> 
  rt_gt()
MGUS status No MGUS MGUS
Deaths/participants 2153/7759 16/115
Hazard ratio (95% CI) 1 (reference) 0.44 (0.27, 0.73)
Female
Deaths/participants 1154/4282 11/68
Hazard ratio (95% CI) 1 (reference) 0.55 (0.30, 0.99)
Male
Deaths/participants 999/3477 5/47
Hazard ratio (95% CI) 1 (reference) 0.32 (0.13, 0.76)
tribble(
  ~type,          ~label,                    ~stratum,
  "events/total", "Deaths/participants",     "",
  "hr",           "Hazard ratio (95% CI)",   c("F", "M"),
  "",             "*Female*",                "F",
  "events/total", "  Deaths/participants",   "F",
  "hr",           "  Hazard ratio (95% CI)", "F",
  "",             "*Male*",                  "F",
  "events/total", "  Deaths/participants",   "M",
  "hr",           "  Hazard ratio (95% CI)", "M",
) |> 
  mutate(
    exposure = "age_group",
    event = "death",
    time = "futime",
    effect_modifier = "sex"
  ) |> 
  rifttable(data = flchain) |> 
  rt_gt()
Age at inclusion, years 50-<65 65-80 80+
Deaths/participants 443/4373 1088/2736 638/765
Hazard ratio (95% CI) 1 (reference) 4.4 (4.0, 5.0) 17 (15, 19)
Female
Deaths/participants 193/2272 528/1538 444/540
Hazard ratio (95% CI) 1 (reference) 4.5 (3.8, 5.3) 20 (17, 24)
Male
Deaths/participants 250/2101 560/1198 194/225
Hazard ratio (95% CI) 1 (reference) 4.6 (4.0, 5.3) 16 (13, 19)

Wilcox AJ. On precision. Epidemiology. 15;2004:1.

rifttable’s data model

The design

design <- tribble(
  ~type,         ~label,
  "total",       "Participants",
  "events/time", "Deaths/person-years",
  "cuminc",      "10-year risk",
  "cumincratio", "10-year risk ratio (95% CI)",
  "hr",          "Hazard ratio (95% CI)"
) |> 
  mutate(
    exposure = "mgus_factor",
    event = "death",
    time = "futime",
    arguments = list(list(timepoint = 10))
  )

The data

# A tibble: 7,874 × 7
     age sex   sample_yr creatinine mgus_factor futime death
   <dbl> <fct>     <dbl>      <dbl> <fct>        <dbl> <dbl>
 1    97 F          1997        1.7 No MGUS      0.233     1
 2    92 F          2000        0.9 No MGUS      3.51      1
 3    94 F          1997        1.4 No MGUS      0.189     1
 4    92 F          1996        1   No MGUS      0.315     1
 5    93 F          1996        1.1 No MGUS      2.84      1
 6    90 F          1997        1   No MGUS      3.71      1
 7    90 F          1996        0.8 No MGUS      7.81      1
 8    90 F          1999        1.2 No MGUS      1.02      1
 9    93 F          1996        1.2 No MGUS      9.06      1
10    91 F          1996        0.8 No MGUS      3.63      1
# ℹ 7,864 more rows

rifttable’s data model

The design

# A tibble: 5 × 6
  type        label                  exposure event time  arguments   
  <chr>       <chr>                  <chr>    <chr> <chr> <list>      
1 total       Participants           mgus_fa… death futi… <named list>
2 events/time Deaths/person-years    mgus_fa… death futi… <named list>
3 cuminc      10-year risk           mgus_fa… death futi… <named list>
4 cumincratio 10-year risk ratio (9… mgus_fa… death futi… <named list>
5 hr          Hazard ratio (95% CI)  mgus_fa… death futi… <named list>

The data

# A tibble: 7,874 × 7
     age sex   sample_yr creatinine mgus_factor futime death
   <dbl> <fct>     <dbl>      <dbl> <fct>        <dbl> <dbl>
 1    97 F          1997        1.7 No MGUS      0.233     1
 2    92 F          2000        0.9 No MGUS      3.51      1
 3    94 F          1997        1.4 No MGUS      0.189     1
 4    92 F          1996        1   No MGUS      0.315     1
 5    93 F          1996        1.1 No MGUS      2.84      1
 6    90 F          1997        1   No MGUS      3.71      1
 7    90 F          1996        0.8 No MGUS      7.81      1
 8    90 F          1999        1.2 No MGUS      1.02      1
 9    93 F          1996        1.2 No MGUS      9.06      1
10    91 F          1996        0.8 No MGUS      3.63      1
# ℹ 7,864 more rows

The table produced by rifttable()

rifttable(
  design = design,
  data = flchain
)
# A tibble: 5 × 3
  `MGUS status`               `No MGUS`     MGUS             
  <chr>                       <chr>         <chr>            
1 Participants                7759          115              
2 Deaths/person-years         2153/77632    16/1292          
3 10-year risk                0.24          0.11             
4 10-year risk ratio (95% CI) 1 (reference) 0.45 (0.30, 1.01)
5 Hazard ratio (95% CI)       1 (reference) 0.44 (0.27, 0.73)

rifttable’s data model

The design

# A tibble: 5 × 6
  type        label                  exposure event time  arguments   
  <chr>       <chr>                  <chr>    <chr> <chr> <list>      
1 total       Participants           mgus_fa… death futi… <named list>
2 events/time Deaths/person-years    mgus_fa… death futi… <named list>
3 cuminc      10-year risk           mgus_fa… death futi… <named list>
4 cumincratio 10-year risk ratio (9… mgus_fa… death futi… <named list>
5 hr          Hazard ratio (95% CI)  mgus_fa… death futi… <named list>

The data

# A tibble: 7,874 × 7
     age sex   sample_yr creatinine mgus_factor futime death
   <dbl> <fct>     <dbl>      <dbl> <fct>        <dbl> <dbl>
 1    97 F          1997        1.7 No MGUS      0.233     1
 2    92 F          2000        0.9 No MGUS      3.51      1
 3    94 F          1997        1.4 No MGUS      0.189     1
 4    92 F          1996        1   No MGUS      0.315     1
 5    93 F          1996        1.1 No MGUS      2.84      1
 6    90 F          1997        1   No MGUS      3.71      1
 7    90 F          1996        0.8 No MGUS      7.81      1
 8    90 F          1999        1.2 No MGUS      1.02      1
 9    93 F          1996        1.2 No MGUS      9.06      1
10    91 F          1996        0.8 No MGUS      3.63      1
# ℹ 7,864 more rows

The table produced by rifttable()

rifttable(
  design = design,
  data = flchain
) |> 
  rt_gt()
MGUS status No MGUS MGUS
Participants 7759 115
Deaths/person-years 2153/77632 16/1292
10-year risk 0.24 0.11
10-year risk ratio (95% CI) 1 (reference) 0.45 (0.30, 1.01)
Hazard ratio (95% CI) 1 (reference) 0.44 (0.27, 0.73)

Great software for analyses and tables exists

  • Excel calculator Episheet
    by Kenneth Rothman

  • R package gtsummary
    by Daniel Sjoberg

  • Many others, of varying scope and quality

Why rifttable


  • Integrates into real data analysis
  • Shows descriptive and inferential statistics side-by-side
  • Makes stratified analyses easy
    (and table 2 fallacies hard)
  • Rounds estimates well
  • Facilitates sensitivity analyses
  • Extends with custom estimators
  • Has a website with copy-paste examples

The IDEFICS/I.Family cohort

The IDEFICS/I.Family cohort: Table 1

rifttable, v0.7.1

Also…

  • Quantiles and quantile regression
  • Ratios of continuous outcomes
  • Competing events
  • Clustering
  • Trends (slopes)
  • Stratified and joint models
  • Inverse probability-weighted estimates
  • Regression-based risk ratios and risk differences for binary outcomes

https://stopsack.github.io/rifttable

Regression models for risk ratios and risk differences

In cohort studies […], there is little reason to consider an odds ratio. While odds ratios are not the primary measure of interest, they are frequently reported because of the statistical model used by the investigators to analyze their data.

Exposure Unexposed Exposed
Total 1000 1000
Cases/non-cases 20/980 40/960
Risk 2% 4%
Risk ratio 1 (reference) 2.00 (1.18, 3.4)
Odds ratio 1 (reference) 2.04 (1.20, 3.6)
Exposure Unexposed Exposed
Total 1000 1000
Cases/non-cases 960/40 980/20
Risk 96% 98%
Risk ratio 1 (reference) 1.02 (1.01, 1.04)
Odds ratio 1 (reference) 2.04 (1.20, 3.6)

Why are we still reporting odds ratios in cross-sectional and cohort studies?

  1. Log-binomial models do not converge
data(breastcancer, package = "risks")
glm(
  formula = death ~ stage + receptor, 
  data = breastcancer, 
  family = binomial(link = "log")
)
Error: no valid set of coefficients has been found: please supply starting values
  1. “Poisson” models with robust variance often do not converge either
  1. Marginal standardization/g-computation/g-formula with time-fixed exposure works whenever a logistic model can be fit
  1. And there is Miettinen’s case-duplication approach

We no longer need to report odds ratios in cross-sectional and cohort studies

library(risks)
summary(
  riskratio(
    formula = death ~ stage + receptor, 
    data = breastcancer
  )
)

Risk ratio model, fitted via marginal standardization of a logistic model with delta method (margstd_delta).
Call:
stats::glm(formula = death ~ stage + receptor, family = binomial(link = "logit"), 
    start = "(no starting values)")

Coefficients: (3 not defined because of singularities)
               Estimate Std. Error z value Pr(>|z|)    
stageStage I     0.0000     0.0000     NaN      NaN    
stageStage II    0.8989     0.3875   2.320   0.0203 *  
stageStage III   1.8087     0.3783   4.781 1.75e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 228.15  on 191  degrees of freedom
Residual deviance: 185.88  on 188  degrees of freedom
AIC: 193.88

Number of Fisher Scoring iterations: 4

Confidence intervals for coefficients: (delta method)
                   2.5 %   97.5 %
stageStage I   0.0000000 0.000000
stageStage II  0.1395299 1.658324
stageStage III 1.0671711 2.550242

Estimating and reporting adjusted risk ratios and risk differences made easy

tribble(
  ~type,            ~label,           
  "outcomes/total", "Deaths/total",
  "risk",           "Risk",    
  "rr",             "Risk ratio (95% CI)",
  "rd",             "Risk difference (95% CI)",
) |> 
  mutate(
    outcome = "death",
    exposure = "stage",
    confounders = "+ receptor"
  ) |> 
  rifttable(data = breastcancer) |> 
  rt_gt() |> 
  gt::tab_footnote(footnote = "Adjusted for hormone receptor status.")
Stage Stage I Stage II Stage III
Deaths/total 7/67 26/96 21/29
Risk 0.10 0.27 0.72
Risk ratio (95% CI) 1 (reference) 2.46 (1.15, 5.3) 6.1 (2.91, 13)
Risk difference (95% CI) 0 (reference) 0.16 (0.05, 0.28) 0.57 (0.38, 0.77)
Adjusted for hormone receptor status.

Compare methods if you like – in practice, simply use g-computation for RRs and RDs

tribble(
  ~type,            ~label,                               ~arguments,
  "outcomes/total", "Deaths/total",                       NA,
  "risk",           "Risk",                               NA,
  "",               "Risk ratio (95% CI)",                NA,
  "rr",             "  *g*-computation, delta method CI", list(approach = "margstd_delta"),
  "rr",             "  *g*-computation, bootstrap CI",    list(approach = "margstd_boot"),
  "rr",             "  Poisson model, robust CI",         list(approach = "robpoisson"),
  "rr",             "  Case-duplication approach",        list(approach = "duplicate"),
  "rd",             "Risk difference (95% CI)",           NA,
) |> 
  mutate(
    outcome = "death",
    exposure = "stage",
    confounders = "+ receptor"
  ) |> 
  rifttable(data = breastcancer) |> 
  rt_gt() |> 
  gt::tab_footnote(footnote = "Adjusted for hormone receptor status.")
Stage Stage I Stage II Stage III
Deaths/total 7/67 26/96 21/29
Risk 0.10 0.27 0.72
Risk ratio (95% CI)
g-computation, delta method CI 1 (reference) 2.46 (1.15, 5.3) 6.1 (2.91, 13)
g-computation, bootstrap CI 1 (reference) 2.46 (1.09, 5.9) 6.1 (3.0, 15)
Poisson model, robust CI 1 (reference) 2.52 (1.17, 5.4) 5.9 (2.78, 13)
Case-duplication approach 1 (reference) 2.52 (1.16, 5.5) 5.9 (2.79, 13)
Risk difference (95% CI) 0 (reference) 0.16 (0.05, 0.28) 0.57 (0.38, 0.77)
Adjusted for hormone receptor status.

risks, v0.4.3








https://stopsack.github.io/risks

Summary: Epidemiologic results tables in 2025

  • The tables in our manuscripts can and should be made directly by software

  • Several R packages create various types of tables

    • The rifttable package is designed specifically for epidemiologists and with best-practice result reporting in mind
    • One perk are regression models for adjusted risk ratios and risk differences for binary outcomes from the risks package
  • This automation makes table creation easier, faster, and reproducible

    • We can focus on thinking about our results instead of transcribing numbers

Thank you for your attention – Please reach out

stopsack@leibniz-bips.de

We are hiring!

The Leibniz Institute for Prevention Research and Epidemiology – BIPS is one of Germany’s largest and oldest research institutes focused on epidemiology.

Applications are open now for our PhD Program in Epidemiology, Statistics, and Prevention and Implementation Science, with an October start date.


https://tinyurl.com/
   bips-phd-program


Feel free to reach out about this opportunity and others.

stopsack@leibniz-bips.de